AITopics | teacher model

Collaborating Authors

teacher model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Deep Insights into Noisy Pseudo Labeling on Graph Data

Neural Information Processing SystemsApr-30-2026, 06:24:49 GMT

Pseudo labeling (PL) is a wide-applied strategy to enlarge the labeled dataset by self-annotating the potential samples during the training process. Several works have shown that it can improve the graph learning model performance in general. However, we notice that the incorrect labels can be fatal to the graph training process. Inappropriate PL may result in the performance degrading, especially on graph data where the noise can propagate. Surprisingly, the corresponding error is seldom theoretically analyzed in the literature.

artificial intelligence, machine learning, prediction, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Knowledge Distillation Performs Partial Variance Reduction

Neural Information Processing SystemsApr-30-2026, 05:37:23 GMT

Knowledge distillation is a popular approach for enhancing the performance of "student" models, with lower representational capacity, by taking advantage of more powerful "teacher" models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of partial variance reduction, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the "teacher" model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.

artificial intelligence, distillation, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Distance OP

Neural Information Processing SystemsApr-29-2026, 23:57:32 GMT

Conventional KD methods propose various designs to allow student model to imitate the teacher better. However, these MultiScale handcrafted KD designs heavily rely on expert knowledge and may be sub-optimal for various teacher-student pairs.

distillation, evolutionary algorithm, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.71)
Information Technology > Artificial Intelligence > Cognitive Science (0.70)
(3 more...)

Add feedback

SLaM: Student-Label Mixing for Distillation with Unlabeled Examples

Neural Information Processing SystemsApr-29-2026, 22:04:00 GMT

Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates "soft" pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called "forward loss-adjustment" methods.

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Supplementary Materials for the Paper " L2T-DLN: Learning to Teach with Dynamic Loss Network "

Neural Information Processing SystemsApr-28-2026, 21:43:01 GMT

In this supplementary material, we provide the proofs of convergence analysis in Section 1, 1-vs-1 transformation employed in the classification and semantic segmentation tasks in Section 2, the coordinate-wise and the preprocessing method of the LSTM teacher in Section 3, the loss functions of YOLO-v3 in Section 4, more experiments of image classification in Section 5, and the inferences of semantic segmentation in Section 6. A differentiable function e()is L-smooth with gradient Lipschitz constant C (uniformly Lipschitz continuous), if e(x) e(y) C x y, x,y. The function is called block-wise smooth with gradient Lipschitz Ci, if i e(x i,xi) ie(x i,x i) Ci xi x i, x,x (1) or with gradient Lipschitz constants { Ci}, if i e(x i,xi) ie(x i,xi) Ci x i x i, x,x (2) Further, Let Gmax max{Ci, Ci, k} C. Definition 2. For a differentiable function e(), if e(x) = 0, then x is a first-order stationary solution (SS1). For a differentiable function e(), if x is a SS1, and there exists ϵ > 0 so that for any y in the ϵ-neighborhood of x, we have e(x) e(y), then xis a local minimum. A saddle point xis an SS1 that is not a local minimum. If λmin( 2e(x)) < 0, x is a strict (non-degenerate) saddle point.

artificial intelligence, machine learning, student model, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.47)

Industry: Education (0.32)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

L2T-DLN: Learning to Teach with Dynamic Loss Network

Neural Information Processing SystemsApr-28-2026, 21:42:58 GMT

With the concept of teaching being introduced to the machine learning community, a teacher model start using dynamic loss functions to teach the training of a student model. The dynamic intends to set adaptive loss functions to different phases of student model learning. In existing works, the teacher model 1) merely determines the loss function based on the present states of the student model, i.e., disregards the experience of the teacher; 2) only utilizes the states of the student model, e.g., training iteration number and loss/accuracy from training/validation sets, while ignoring the states of the loss function. In this paper, we first formulate the loss adjustment as a temporal task by designing a teacher model with memory units, and, therefore, enables the student learning to be guided by the experience of the teacher model. Then, with a dynamic loss network, we can additionally use the states of the loss to assist the teacher learning in enhancing the interactions between the teacher and the student model. Extensive experiments demonstrate our approach can enhance student learning and improve the performance of various deep models on real-world tasks, including classification, objective detection, and semantic segmentation scenarios.

artificial intelligence, loss function, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Europe (0.28)
Asia > China (0.14)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

233278d812e74a4f9848410881db86b1-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 21:42:08 GMT

artificial intelligence, machine learning, robustness, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Genre: Research Report > New Finding (0.67)

Industry:

Education (0.70)
Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale

Neural Information Processing SystemsApr-25-2026, 18:54:15 GMT

The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the small data pitfall that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of small-scale datasets. Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios. Without bells and whistles, we obtain state-of-the-art ResNet50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83.1%, 84.3%, and 85.0% top-1 accuracy, respectively.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Filters

Collaborating Authors

teacher model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Deep Insights into Noisy Pseudo Labeling on Graph Data

Knowledge Distillation Performs Partial Variance Reduction

b83bea9688047be30f54034c55716854-Supplemental-Datasets_and_Benchmarks_Track.pdf

Distance OP

SLaM: Student-Label Mixing for Distillation with Unlabeled Examples

Supplementary Materials for the Paper " L2T-DLN: Learning to Teach with Dynamic Loss Network "

L2T-DLN: Learning to Teach with Dynamic Loss Network

233278d812e74a4f9848410881db86b1-Paper-Conference.pdf

4ca82782c5372a547c104929f03fe7a9-Supplemental.pdf

Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale